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Abstract 


A.l. Neural Networks (NN) constitute interesting support for the analysis of the text-image relation- 
ship, which, since ancient times, has stimulated essential reflections on the resulting ontological value. 
Among these, the Text- To-Image (TTI), designed to learn the transformation of a text into an image, 
is the most suitable to make an innovative contribution to this complex investigation. In the literature, 
the applications of the TTI are many and concern both the learning of the network and its training 
in constructing images starting from ad hoc descriptions. This paper instead investigates the training 
of the NN in transforming the descriptions of treatise texts into images. This operation is complex 
because it requires knowledge of the two different constructive systems of the ‘text architecture’, 
one governed by the grammar of the sentences and the other by the grammar of the text, which in- 
troduces a multitude of variables that are not easily decodable by neural networks. The contribution 
presents the results of the first phase of the research with the development of a procedure based 
on the interception of sentences and paradigmatic relationships to be decoded for the formation 
of the network. The experimentation is conducted on the treatise Le Vite by Giorgio Vasari, which 
contains an accurate description of the grotesques, created by the author in the church of Sant’Anna 
dei Lombardi in Naples for which graphic and geometric analyses have already been carried out. 
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Introduction 


The oldest relationship between words and images dates back to the 2nd century when 
Hermogenes introduced ékphrasis, as descriptive speech which puts the object before the 
eyes. In particular, in modern thought, ékphrasis finds its most natural place in the verbal 
description of art, tracing “figurative and literary tòpoi (...) from the descriptions of the great 
masterpieces” [Albanese 2013, p. 2]. Over the centuries, analysing the delicate intertwining 
of text and images has aroused essential reflections on the resulting ontological value. Alll 
images, graphic, optical and perceptive, more objective, mental and verbal, and less objec- 
tive, assume the same value according to Wittgenstein’s theory of images [Wittgenstein 
1922] and WJ.T. Mitchell [Mitchell 1995]. If the signs of language are “images of what they 
represent" [Wittgenstein 1922, p. 3], the image itself is nothing more than a graphic sign of 
the object it represents [Mitchell 1995]. Images, in the broad sense of text, image or idea, 
are just semblances of the real world expressed differently. The investigation into the rela- 
tionship between ‘image and text’ finds, in the opinion of the authors, the most exhaustive 
outcome in the triple ‘typographic convention’ proposed by Mitchell himself [Mitchell 1995]: 
the ‘image/text’, with the slash, highlights a problematic void, a fracture in the representation 
of their relationship; ‘imagetext’, without glyphs, designates composite and synthetic works 
(or concepts), which combine image and text;‘image-text’, with the insertion of the hyphen, 
underlines the relationship between what is visual and verbal. The latter relationship is in- 
vestigated in this paper, which presents the first reflections on the topic with the awareness 
that the visual-verbal relationship is only a fractal of the more complex ‘imagetext’, and, 
therefore, also requires the use of different sensory channels (eye and ears), the definition 
of semiotic functions (iconic aspect and arbitrariness of the symbol), the identification of 
the cognitive modality (space-time) and the application of operative codes (analogue, digital, 
A\l.). The tool used is the artificial intelligence (A.l.) of neural networks, i.e., mathematical-IT 
calculation models, which, inspired by the biological functioning of the human brain, can build 
processes based on information int 


terconnections. The science that deals with the definition 
and management of interconnections is called ‘connectionism’ and is based on the Parallel 
Distributed Processing (PDP) of information. 
n summary, at the basis of the artificial neural network, there are algorithms which, in an 
‘adaptive’ way, can connect external data (training) with internal design information (learn- 
ing) of the network itself, modifying its structure (nodes and interconnections or arches) 
from time to time. The experimentation took place on the treatise Le Vite by Giorgio Vasa- 
ri, which contains an interesting description of the grotesque iconographic apparatus he 
created in the church of Sant'Anna dei Lombardi in Naples. Among the numerous neural 
networks offered by the technological market, Text-To-Image (TTI), i.e., networks capable of 
transforming a text, formulated in natural language, into an image, are the most suitable for 
the research. The choice is due to the possibility of training the NN thanks to comparing 
texts and images. In fact, from the images, it is possible to acquire significant data thanks to 
the geometric and graphic analysis previously performed through a digital photogrammetric 


survey. [Miele et al. 2022]. 


A.l. neural networks. State of the art 


The use of A.l. Neural Networks was born in the early 40s of the last century with the 
demonstration of the implementation of the algorithm underlying the Turing machine [Mc- 
Culloch and W. Pitts 1943]. For several years, the term neural networks (NN) has included 
biological and artificial without distinction. The NNs, whose application involves many sec- 
tors of the soft and hard sciences, are intelligent systems capable of artificially reproducing 
the performance of an expert person in a specific domain of knowledge or field of activity 
and, therefore, capable of identifying the solution for any complex problem. The functionality 
of NNs is organised on three levels, involving thousands of nodes and tens of thousands of 
connections (arcs). For the various levels, the nodes and the arches have the task of receiv- 
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image to be created through the A.l. This operation requires careful linguistic analysis, con- 
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Fig. |. Workflow of 

the procedure. The 
graph shows the 
interdisciplinarity of the 
image-text relationship at 
the basis of visual culture. 
The training operations 
of the network are 
iterable according to 

the result obtained. 
Elaboration by the 
authors. 


* Identification of the portion of text containing 
the description of the image to be created 

* Linguistic analysis of the original text 

(official languages/dialects, historical period) 


* Identification of image descriptors 
* Keywords within descriptors (kws) 


* Transformation of keywords by language 
original (Kws_o) natural today (Kws_n) 
* Translation of Kws_n into English 
* Construction of the prompt syntax 


* Net setting 


Aspect ratio (image proportions) 

Steps (Iterations) 

Guidance scale (Level of freedom) 
Seed (randomness agent) 

Negative prompt (negative suggestions) 


NOT 
ADEQUATE a 


textualised to the historical period of the text to understand if the semantic units used by 
the author can be assimilated to the ‘natural language’ previously defined. The second phase 
— ‘textual synthesis’ — isolates a finite number of words (nodes) that describe the image. 
They can be thought of as keywords that categorise image elements. The third phase —‘con- 
struction of the prompt’ — modulates the language, which must be transformed from the 
original (the treatises) into natural language. This operation is essential to obtain a result as 
close as possible to the real one. Furthermore, the NNs have been trained in English, so the 
untranslatability of some words from one language to another plays an important role. The 
transformation from the mother tongue (e.g., vernacular) to the natural one and, therefore, 
to the English language generates various degrees of arbitrariness, which must be controlled 
during the formation of the network. Once the words have been decoded in this phase, the 
syntax (arcs) construction is necessary to identify the most suitable morphological signals. 
The process generally requires several attempts before getting the optimal solution. The 
fourth phase — ‘prompt insertion in NN’ — consists of processing the prompt obtained in 
the previous phases using the trained neural network. In this phase, the machine must set 
the operating parameters, such as: 
- ‘Aspect ratio’is the width to the height of an image, expressed as two numbers separated 
by colons,such as | 6:9 or l:1.Forthe xy aspect ratio, the image is x units wide and y units high; 
- ‘Steps’ are the number of steps required by A.l. during creation. The more passes, the better 
the overall image quality; 
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- ‘Guidance scale’ is the level of freedom (or precision) attributed to the A.l. in the train- 
ing phase, starting at the prompt. Higher levels of values force the A.l. automatically 
follow the request more rigorously; 

- ‘Seed or agent of randomness’ for the Al. The same seed and the same prompt will pro- 
duce the same picture. Reusing the same seed with different suggestions can produce 
a consistent style; 

- ‘Negative prompt’ is the description of elements not contained in the image. The All. 


that way does not use concepts/terms listed in the negative prompt. 


The fifth stage — ‘image result’ — shows the result of the network processing. It is subject 
to the operator's judgement, who may deem the result of the treatment adequate or not. 
In the latter case, the process must be repeated from the third stage. The prompt must be 
reformulated by modifying the syntax and setting the parameters of the neural net. 

The procedure was applied to the case study of Vasari’s grotesques in the sacristy of 
Sant'Anna dei Lombardi in Naples, described by the author in his treatise Le Vite. 


Case study. The images of the grotesques from Vasari’s treatise, Text-To-Image proce- 
dure 


In 1544 Giorgio Vasari from Arezzo, a Mannerist painter and historian, was commissioned 
to decorate the vault of the Sacristy of Sant'Anna dei Lombardi in Naples and did so by 
inserting grotesque decorations. The interesting aspect of research is the description of 
these decorations in Vasari's treatise, Le Vite, considered the first modern treatise in art 
history. An accurate analysis of the decorative apparatus can be found in [Miele et al. 2022] 
(fig. 2). 

The possibility of comparing the description of the text on the grotesque apparatus with 
the images of the actual decorations, acquired with a digital photogrammetric survey, has 
allowed the application of the previously described procedure. The operations are carried 
out on the vault, which has anthropomorphic and phytomorphic decorations. Once the 
semantic units in the description of the treatise were identified, the neural networks were 
applied for the production of images from the text. The comparison between the original 
text, written in the sixteenth-century language, with the natural language on the first two 
images has generated only one level of arbitrariness in the original-natural-English language 
flow. In the third image, on the other hand, based on words such as scarpelloni, which are 
currently obsolete, the level of arbitrariness increases (fig. 3). 

The first elaboration concerns the image of the ‘horse with legs made of leaves’. Various 
attempts were necessary for lexical and syntactic choice, modifying the prompt several 
times (fig.4). The application pointed out some network limitations. Notably, the Stable 
Diffusion neural network trained with DB LAION exhibits failure rates of 25% on animal 
imb generation. In order to obtain a result more similar to reality, we finally proceeded 
with impainting, i.e. partial training of the network with specific reference images chosen 
by the operator (fig. 5). 
For the second sentence of the treatise (man-legs-crane), the A.l. handles the text as 
shown in figure 6. In particular, many images are inadequate as the network censors various 
proposals for the non-explicit request for nudity. Furthermore, it is observed that the im- 
ages are far from Vasari’s representations without a forced network training and inpainting 
procedure. 
Several attempts have been made for the third sentence of the treatise, three of which are 
shown in figure 7. The network has substantial limitations on managing words that express 
quantities (Infinities), but above all, in managing multiple degrees of arbitrariness, which 
distances the image from the desired results. 

If the level of arbitrariness is |, depending on the operator's interpretation, the image is 
very close to the real one. By increasing the levels of arbitrariness, the final image is very 
distant from the real one. 
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Fig. 2. Digital 
photogrammetric 
model of the sacristy 
and description of the 
iconographic apparatus 
of the central vault. 
Graphic elaboration by 
the authors. 


Fig. 3.The procedure 
for choosing semantic 
units (kw_o), comparing 
the original language 
and the natural one 
(kw_n), and translating 
words into English. Each 
pass generates levels 
of arbitrariness that 
affect network training. 
Elaboration by the 
authors. 


Fig. 4. Processing attempts 
of the first image, ‘on a 
horse the legs of leaves’, 
with the Al Open- 
Source Stable Diffusion, 
modifying the network 
setting and the prompt. 
Elaboration by the 
authors. 


Fig. 5. Adequate result: a) 
Original photo entered 
by the operator in the 
network; b) ‘Impainting’ 
with Al processing. 
Elaboration by the 
authors. 


Fig. 6. AI elaborations 
on the second chosen 
sentence of the treatise. 
Graphic elaboration by 
the authors. 
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Prompt 
Aspect ratio 


<Horse with leaves legs> 
Square (1:1) 

Steps High (50) 

Guidance scale Normal (7.5) 

Seed 7674627867078122 

Negative Prompt - 


try 


Prompt <Horse with leaves legs, fresco> 
Aspect ratio Square (1:1) 

Steps Extreme (100) 
Guidance scale Very strict (17.5) 

Seed 7909556553530120 
Negative Prompt No perspective 


Prompt <Half woman half horse with the 
wings of an angel and with leaves legs, 
fresco, white background> 


Aspect ratio Square (1:1) 

Steps Extreme (100) 
Guidance scale Very strict (17.5) 
Seed 6527961537147399 


Negative prompt No perspective 


Impainting: user photo 

Prompt <horse with leaves legs, fresco> 
Aspect ratio Square (1:1) 

Steps High (50) 

Guidance scale Normal (7.5) 

Seed 4973245040653061 
Negative prompt No perspective 
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“[...] Le grottesche sono una spezie di pittura licenziose e ridicole molto, fatte 
daglantichi per ornamenti di vani, dove in alcuni luoghi non stava bene altro che 
cose in aria; per il che facevano in quelle tutte sconciature di monstri per 
strattezza della natura e per gricciolo e ghiribizzo degli artefici, i quali fanno in 
quelle cose senza alcuna regola, apiccando a un sottilissimo filo un peso che 
non si può reggere, a un cavallo le gambe di foglie, a un uomo le gambe di 
gru, et infiniti sciarpelloni e passerotti; e chi più stranamente se gli 
immaginava, quello era tenuto più valente [...]”. (Vasari, 1986, p. 73) 
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Prompt <Woman/horse leaves legs with 
the wings of an angel, fresco> 


Aspect ratio Tablet (2:3) 

Steps High (50) 
Guidance scale Normal (7.5) 

Seed 5431238313039876 


Negative prompt No perspective 


Fig. 7.AI elaborations 
on the third chosen 


sentence of the treatise, 


with multiple levels of 
arbitrariness. Graphic 
elaboration by the 
authors. 


Conclusion 


Since ancient times, the debate on the relationship between text and image has continu- 
ously developed exciting themes, which have taken on different connotations depending 
on the knowledge developed and the tools used. Currently, the study of the relationship 
converges in the discipline called visual culture, defined by Mitchell as ‘interdisciplinary’ or 
rather ‘indiscipline’, thus including numerous specificities, all of which are fundamental for 
understanding the connections between texts and images. Starting from this assumption, 
the paper presents the first results of a new analysis of the image-text relationship based 
on using the most recent Al-neural network technology The goal is to investigate network 
training to create images derived from treatises, which require in-depth knowledge of ‘text 
architecture’. The first results highlight the numerous criticalities linked to the historicity of 
language and the imprinting of the image construction. They also suggest that the process 
produces much more realistic results if the components of the visual language, which de- 
termine the compositional structure of the image, are also included in the formation of the 
network. Future research developments want to investigate this latter aspect. 
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